fix(sync): prevent concurrent-session rollbacks from destroying writes on in-memory SQLite#989
Merged
Merged
Conversation
…s cannot destroy writes Root cause of the test_sync_entity_circular_relations CI failure (#940, len(entity_b.outgoing_relations) == 0): the in-memory SQLite URL (sqlite+aiosqlite://) falls back to SQLAlchemy's StaticPool, which hands the same DBAPI connection to every concurrently checked-out session. Concurrent asyncio tasks therefore share one SQLite transaction scope. A rollback issued through one session — scoped_session's exception handler, or the pool's reset-on-return at connection checkin (~40 real ROLLBACKs per sync, measured) — also rolls back any other task's executed-but-uncommitted statements. During batch indexing, per-file tasks run concurrently (asyncio.gather bounded by index_entity_max_concurrent). If a sibling session's checkin ROLLBACK lands between one task's relation INSERT and its COMMIT, the relation row is silently erased: no error is raised, the sync reports success. The final sync-level resolve_relations() pass cannot heal this because it only re-queries rows with to_id IS NULL — the destroyed INSERT leaves no row at all. Fix: give MEMORY-type engines a single-connection AsyncAdaptedQueuePool (pool_size=1, max_overflow=0). The lone connection keeps the in-memory database alive for the engine's lifetime (the reason StaticPool was used), while the blocking checkout serializes sessions at transaction granularity, restoring the isolation the repositories assume. File-based SQLite and Postgres engines are unchanged; production never uses MEMORY engines. The regression test pins the invariant directly: a session that rolls back in one task must never destroy another task's uncommitted writes. It fails deterministically against StaticPool and passes with the serialized pool. tests/repository/test_entity_repository.py held a scoped session open while calling a repository method that opens its own session — tolerated on a shared connection, a deadlock under a serialized pool — so the nested call moved out of the session block. Refs #940 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
… routing tests
Root cause of the test_mcp_sse_forces_local KeyError('FORCE_LOCAL') flake
(#940, Python 3.14 leg): the stdio variants of the mcp routing tests invoke
`bm mcp`, which starts a real background auto-update daemon thread before
mcp_server.run (the tests only mock run). That thread hits PyPI and then
rewrites config.json (auto_update_last_checked_at) with an in-place
Path.write_text, which truncates the file before writing. If that write lands
while a later test's CLI invocation is reading config.json (the app callback's
CliContainer.create()), load_config() sees empty/partial JSON and raises
SystemExit. CliRunner.invoke swallows it, the mocked mcp_server.run never
executes, and the test dies with KeyError on env_at_run['FORCE_LOCAL'] —
exactly the observed CI failure shape. Nothing is 3.14-specific; that leg
only shifted the timing.
The same torn write is user-visible in production: the MCP stdio server
re-reads config.json on mtime change and load_config() exits the process on
invalid JSON if it races a CLI save.
Fix: save_basic_memory_config writes a per-process/per-thread sibling temp
file and publishes it with os.replace, so readers always observe either the
old or the new complete document. The regression test injects an interrupted
write and asserts the published config stays untouched; it fails against the
old in-place write.
Test hardening: the stdio routing tests stub run_auto_update so no PyPI call
or config write leaks across tests, and all four transport tests now assert
result.exit_code == 0 so a future pre-run failure surfaces its real error
instead of a KeyError.
Refs #940
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Failure 1: test_sync_entity_circular_relations — lost relation row (the important one)
Root cause
The in-memory SQLite URL (
sqlite+aiosqlite://) falls back to SQLAlchemy's StaticPool, which hands the same DBAPI connection to every concurrently checked-out session. All concurrent asyncio tasks therefore share one SQLite transaction scope:COMMITby one session commits every sibling task's half-done statements;ROLLBACKby one session silently destroys every sibling task's executed-but-uncommitted statements.Rollbacks are routine, not exceptional: the pool's reset-on-return issues a real DBAPI
ROLLBACKat every connection checkin. A probe counting DBAPI calls during one sync of the two-file circular-relations scenario measured 40 commits and 41 rollbacks on the shared connection.scoped_session's exception handler is a second rollback source (any per-file error swallowed by_run_bounded).During
_index_changed_files→BatchIndexer.index_files, per-file upserts run concurrently (asyncio.gatherbounded byindex_entity_max_concurrent). The losing interleaving:COMMIT(separate aiosqlite roundtrip);add_all_ignore_duplicatesexecutesINSERT INTO relation(B→A, in the gap between A's commit and A's checkin — both are independent await points);ROLLBACK→ B's INSERT is erased;The final sync-level
resolve_relations()pass cannot catch this: "pending" means a relation row withto_id IS NULL, and the destroyed INSERT leaves no row at all. Hencelen(entity_b.outgoing_relations) == 0. #938 exposed the race by removing the multi-second per-entity embedding work that previously kept the upsert tasks from interleaving tightly. The window is microseconds, which is why it fired once on a loaded 2-core runner and never locally (30 pytest runs + 500 in-process iterations clean).Production is not affected: file-based SQLite uses per-session pooled connections (isolated transactions, WAL) and Postgres uses NullPool.
DatabaseType.MEMORYis only used by the test environment viaengine_session_factory— but the unsafe engine configuration is product code indb.py.Fix
_create_sqlite_enginenow gives MEMORY engines a single-connection blocking pool (AsyncAdaptedQueuePool, pool_size=1, max_overflow=0). The lone connection keeps the in-memory database alive for the engine's lifetime (the reason StaticPool was chosen), while the blocking checkout serializes sessions at transaction granularity — restoring the isolation the repositories assume. No sleeps, no retries; nested-session misuse now fails loudly (pool checkout timeout) instead of silently sharing transactions, and one such test (test_delete_entity_with_relations) was un-nested.Evidence
tests/db/test_memory_db_session_isolation.pyfails deterministically on StaticPool (relation count 0 after a sibling session rolls back) and passes with the serialized pool.Failure 2: test_mcp_sse_forces_local — KeyError: 'FORCE_LOCAL' (Python 3.14 leg)
Root cause
The stdio variants of
TestMcpCommandRoutinginvokebm mcp, whose command body starts a real background auto-update daemon thread beforemcp_server.run(the tests only mockrun). That thread hits PyPI (5s timeout) and then rewritesconfig.json(auto_update_last_checked_at) with an in-place truncatingPath.write_text. When that write lands while a later test's CLI invocation readsconfig.json(app callback →CliContainer.create()→load_config()), the reader sees empty/partial JSON andload_config()raisesSystemExit.CliRunner.invokeswallows it, the mockedmcp_server.runnever executes, and the test fails withKeyError: 'FORCE_LOCAL'— exactly the CI shape. Nothing is 3.14-specific; the leg only shifted timing (the test also never inspectedresult, discarding the evidence).The torn write is also a real production race: the MCP stdio server re-reads
config.jsonon mtime change and exits the process on invalid JSON if it races a CLI save.Fix
save_basic_memory_configwrites a per-process/per-thread sibling temp file and publishes viaos.replace— readers always see a complete document. Regression test fault-injects an interrupted write; fails against the old in-place write.run_auto_update(no PyPI call / config write leaking across tests), and all four transport tests assertresult.exit_code == 0so future pre-run failures surface the real error.Verification
uv run pytest tests— 2946 passed, 33 skippeduv run pytest test-int/cli— 69 passeduv run ruff check .(changed files clean; pre-existing hermes findings untouched),uv run ruff format .,uv run ty check src tests test-int— clean🤖 Generated with Claude Code
Fixes #940